# Load packages
library(bayesrules)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.4     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.3     ✓ stringr 1.4.0
## ✓ readr   2.0.1     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(rstan)
## Loading required package: StanHeaders
## rstan (Version 2.21.2, GitRev: 2e1f913d3ca3)
## For execution on a local, multicore CPU with excess RAM we recommend calling
## options(mc.cores = parallel::detectCores()).
## To avoid recompilation of unchanged Stan programs, we recommend calling
## rstan_options(auto_write = TRUE)
## 
## Attaching package: 'rstan'
## The following object is masked from 'package:tidyr':
## 
##     extract
library(rstanarm)
## Loading required package: Rcpp
## This is rstanarm version 2.21.1
## - See https://mc-stan.org/rstanarm/articles/priors for changes to default priors!
## - Default priors may change, so it's safest to specify priors, even if equivalent to the defaults.
## - For execution on a local, multicore CPU with excess RAM we recommend calling
##   options(mc.cores = parallel::detectCores())
## 
## Attaching package: 'rstanarm'
## The following object is masked from 'package:rstan':
## 
##     loo
library(bayesplot)
## This is bayesplot version 1.8.1
## - Online documentation and vignettes at mc-stan.org/bayesplot
## - bayesplot theme set to bayesplot::theme_default()
##    * Does _not_ affect other ggplot2 plots
##    * See ?bayesplot_theme_set for details on theme setting
library(tidybayes)
library(janitor)
## 
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
library(broom.mixed)

Exercise 10.1

  1. How fair is the model? We can evaluate this based on a series of smaller questions: 1) How was the data collected? 2) By whom and for what purpose was the data collected? 3) How might the results of the analysis, or the data collection itself, impact individuals and society? 4) What biases might be baked into this analysis?

  2. How wrong is the model? We can evaluate this by checking the model assumptions. Specifically, we want to ensure that three assumptions baked into our regression model are true: 1) Conditioned on X, the observed data Yi on case i is independent of the observed data on any other case j; 2) the typical Y outcome can be written as a linear function of X; 3) at any X value, Y varies normally around mu with consistent variability sigma.

  3. How accurate are the posterior predictive models? We can assess this by measuring the distance of the predicted mean and the observed mean, by calculating how may posterior SDs our observed mean is from the predicted mean, or by assessing which percentile of the predictive distribution the observed mean falls within.

Exercise 10.2

  1. The data was collected from consumer’s online shopping behavior without their consent.

  2. The data is collected to inform a algorithm that is used for price discrimination.

  3. A hiring algorithm discriminates against racial minorities.

  4. An algorithm used to calculate risk assessments (like morbidity estimates, and estimated level of care needed), which is based off of existing health care costs, underestimates the care that racial minorities need because they historically don’t receive as much access to treatment as white people. Therefore, an algorithm that assesses care needs based off data that drawn from a healthcare system that disproportionately serves white people will underestimate the health care needs and risk profiles of racial minorities.

Exercise 10.3

  1. I grew up in Massachusetts on the east coast of the U.S. and am white. I have benefited from immense privilege in terms of my race, familial background, and the educational opportunities I’ve had access to throughout my life.

  2. My perspective might limit my capacity to understand the implicit biases baked into either the data itself or the algorithm’s design. For instance, after reading the article about the health care algorithm’s mistaken risk assessments due to racial biases in health care data (https://www.science.org/doi/full/10.1126/science.aax2342), which I wouldn’t have necessarily thought about before, I am now going to try to interrogate the ways in which structural racism and discrimination generally might contaminate data that is otherwise “well-collected.” Especially with algorithms that have “normative” implications–i.e. are used to change society (even in ways as basic as informing hiring decisions)–it is incredibly important to ensure that training datasets represent the outcomes we’d like to see (this does not mean “proscribing” outcomes– it just means, once we get outcomes, assuring that they are not discriminatory, rather than assuming naively that the model works as we ideally hoped it would).

  3. I don’t really think my privilege gives me much comparative advantage in terms of spotting biases in model design. I think it is more something that I will have to keep checking less it lead to my making false assumptions or overlooking subtleties with significant implications.

Exercise 10.4

  1. There is no such thing as a neutral perspective. Everyone has some bias based on their identity, which informs how they perceive the world

  2. Your model was still created by you. Therefore, during data collection or analysis, it is possible that your biases were baked into your model. Furthermore, often data collected from the real world still contains bias due to structural discrimination. Therefore, simply using real-world data, regardless of your personal bias, can contaminate your model.

  3. I worked on a project that studied the opioid crisis in Massachusetts. Because I knew the regions in which it was worst, I was able to identify which counties to focus on.

Exercise 10.5

Models are attempts to simplify a complicated world into sets of rules designed by humans. But humans could never comprehend, much less articulate, the real world in all of its complexity. Therefore, all models are inherently going to be inaccurate, or perhaps, incomplete.

Exercise 10.6

  1. Conditioned on X, the observed data Yi on case i is independent of the observed data on any other case j; 2) the typical Y outcome can be written as a linear function of X; 3) at any X value, Y varies normally around mu with consistent variability sigma.

Exercise 10.7

  1. We simulate a Y outcome from the Normal data model tuned to this first parameter set (-1.8, 2.1, 0.8).

  2. This process is shown below:

beta_0 <- -1.8
beta_1 <- 2.1
sigma  <- 0.8
set.seed(84735)

x <- c(12, 10, 4, 8, 6)
y <- c(20, 17, 4, 11, 9)
data<- data.frame(x, y)

one_simulation <- data %>% 
  mutate(mu = beta_0 + beta_1 * x,
         simulated_y = rnorm(5, mean = mu, sd = sigma)) %>% 
  select(x, y, simulated_y) %>% 
  mutate(y_avg=mean(y), y_sim_avg=mean(simulated_y))
one_simulation

There is a pretty sizable difference between the predicted and observed Y values. The difference between y observed and y predicted ranges between 2 and 4.

Exercise 10.8

  1. The goal of this check is to compare the observed outcome values with the predictive model’s outputted values. We hope to observe minimal differences between these sets of values.

  2. One measurement of posterior prediction error is just the difference between the observed mean (or median) and the predicted mean (or median). This tells us about the means/medians of the observed and predicted values compare. Another measure is how many standard deviations our observed value falls from the posterior predicted mean/median; this tells us about how the distributions of the observed and predicted values compare. Lastly, there is the interval measure of error: we assess the proportion of observed values that fall within the 50% (or 95%) posterior prediction interval.

Exercise 10.9

  1. The median absolute error tells us about the distance the predictive median is from the observed median.

  2. The scaled median tells us about the typical number of standard deviations the observed Y values from their posterior predictive counterparts (Yi). This is useful because it is scaled to reflect the spread of the distribution, whereas the absolute distance is not. Therefore, the MAE scaled is more useful for cross-study comparisons of accuracy, where the accuracy of predictive models generally is assessed.

  3. The within-50 statistic tells us about the percentage of observed Y values that fall within the 50% posterior prediction interval. This encapsulates how much of our observed data falls within the majority distribution of our predicted data.

Exercise 10.10

  1. The darker density represents the actual observed data; the light color represents the posterior predicted data.

  2. If predicted Y values are similar in feature (shape of distribution) to the original Y data, we can believe our model assumptions are legitimate, and that we have chosen the right type of model. A good fitting model will produce a plot that follows the observed distribution because it is a “good fit”–i.e., it predicts values that are close to the observed values.

  3. If our model fits poorly, its pp_check() predicted plot line would be starkly divergent from the observed Y values.

Exercise 10.11

  1. x=abundance of anchovies in meal; y=Reem’s rating of meal

  2. Reem’s tastes (how much the presence of anchovies affects her rating of a meal)

  3. Train your model with a bunch of Reem’s previous ratings of meals. Test it on a new taco recipe.

  4. Cross-validation would help you develop a successful recipe because if you don’t use cross-validation (i.e. use full Reem dataset for both training and testing) you will produce a model that’s very good at predicting Reem’s tastes, but not very good at producing general predictions of good recipes. If we train and test on separate datasets (by splitting the data), we’re more likely to produce a successful recipe when we apply our model to new data.

Exercise 10.12

  1. Step 1: create folds: split data into k non-overlapping folds ranging from 2 to our original sample size (n); 2) train and test the model: train the model using the combined data in the first k-1 folds, test this model on the kth data fold, measure the prediction quality (e.g. using MAE); 3) repeat step 2 k-1 times, each time leaving ut a different fold for testing; 4) calculate cross-validation estimates: average the prediction measures from step 3 to obtain a single cross-validation estimate of prediction quality.

  2. If we use the same data to train and test a model, we risk “overfitting”–producing a model that is very good at performing on a particular dataset, but is not very generalizable.

  3. I wonder about how we can compare different iterations’ performances–e.g. those with high MAE vs low MAE, to understand what values the model is good/bad at predicting, and thus adapt our algorithm between training iterations to improve performance on those values for which it had high MAE.

Exercise 10.13

Downloading data

data("coffee_ratings")
coffee_ratings <- coffee_ratings %>% 
  select(farm_name, total_cup_points, aroma, aftertaste)

head(coffee_ratings)
  1. Aroma and aftertaste appear to be correlated with the farm the beans come from. Therefore, this data violates the independence assumption of the Bayesian regression model.

  2. Selecting one sample per farm

set.seed(84735)
new_coffee <- coffee_ratings %>% 
  group_by(farm_name) %>% 
  sample_n(1) %>% 
  ungroup()
dim(new_coffee)
## [1] 572   4

Exercise 10.14

  1. Plotting relationship between rating and aroma
ggplot(new_coffee, aes(y = total_cup_points, x = aroma)) + 
  geom_point(size = 0.2) + 
  geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula 'y ~ x'

  1. Using stan_glm() to simulate the normal regression model
coffee_model <- stan_glm(total_cup_points ~ aroma, data = new_coffee,
                       family = gaussian,
                       prior_intercept = normal(75, 10),
                       prior = normal(9, 0.1), 
                       prior_aux = exponential(0.001),
                       chains = 4, iter = 4000*2, seed = 84735)
## 
## SAMPLING FOR MODEL 'continuous' NOW (CHAIN 1).
## Chain 1: 
## Chain 1: Gradient evaluation took 0.000106 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 1.06 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1: 
## Chain 1: 
## Chain 1: Iteration:    1 / 8000 [  0%]  (Warmup)
## Chain 1: Iteration:  800 / 8000 [ 10%]  (Warmup)
## Chain 1: Iteration: 1600 / 8000 [ 20%]  (Warmup)
## Chain 1: Iteration: 2400 / 8000 [ 30%]  (Warmup)
## Chain 1: Iteration: 3200 / 8000 [ 40%]  (Warmup)
## Chain 1: Iteration: 4000 / 8000 [ 50%]  (Warmup)
## Chain 1: Iteration: 4001 / 8000 [ 50%]  (Sampling)
## Chain 1: Iteration: 4800 / 8000 [ 60%]  (Sampling)
## Chain 1: Iteration: 5600 / 8000 [ 70%]  (Sampling)
## Chain 1: Iteration: 6400 / 8000 [ 80%]  (Sampling)
## Chain 1: Iteration: 7200 / 8000 [ 90%]  (Sampling)
## Chain 1: Iteration: 8000 / 8000 [100%]  (Sampling)
## Chain 1: 
## Chain 1:  Elapsed Time: 0.223512 seconds (Warm-up)
## Chain 1:                0.397495 seconds (Sampling)
## Chain 1:                0.621007 seconds (Total)
## Chain 1: 
## 
## SAMPLING FOR MODEL 'continuous' NOW (CHAIN 2).
## Chain 2: 
## Chain 2: Gradient evaluation took 4.3e-05 seconds
## Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.43 seconds.
## Chain 2: Adjust your expectations accordingly!
## Chain 2: 
## Chain 2: 
## Chain 2: Iteration:    1 / 8000 [  0%]  (Warmup)
## Chain 2: Iteration:  800 / 8000 [ 10%]  (Warmup)
## Chain 2: Iteration: 1600 / 8000 [ 20%]  (Warmup)
## Chain 2: Iteration: 2400 / 8000 [ 30%]  (Warmup)
## Chain 2: Iteration: 3200 / 8000 [ 40%]  (Warmup)
## Chain 2: Iteration: 4000 / 8000 [ 50%]  (Warmup)
## Chain 2: Iteration: 4001 / 8000 [ 50%]  (Sampling)
## Chain 2: Iteration: 4800 / 8000 [ 60%]  (Sampling)
## Chain 2: Iteration: 5600 / 8000 [ 70%]  (Sampling)
## Chain 2: Iteration: 6400 / 8000 [ 80%]  (Sampling)
## Chain 2: Iteration: 7200 / 8000 [ 90%]  (Sampling)
## Chain 2: Iteration: 8000 / 8000 [100%]  (Sampling)
## Chain 2: 
## Chain 2:  Elapsed Time: 0.166988 seconds (Warm-up)
## Chain 2:                0.429504 seconds (Sampling)
## Chain 2:                0.596492 seconds (Total)
## Chain 2: 
## 
## SAMPLING FOR MODEL 'continuous' NOW (CHAIN 3).
## Chain 3: 
## Chain 3: Gradient evaluation took 2.4e-05 seconds
## Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.24 seconds.
## Chain 3: Adjust your expectations accordingly!
## Chain 3: 
## Chain 3: 
## Chain 3: Iteration:    1 / 8000 [  0%]  (Warmup)
## Chain 3: Iteration:  800 / 8000 [ 10%]  (Warmup)
## Chain 3: Iteration: 1600 / 8000 [ 20%]  (Warmup)
## Chain 3: Iteration: 2400 / 8000 [ 30%]  (Warmup)
## Chain 3: Iteration: 3200 / 8000 [ 40%]  (Warmup)
## Chain 3: Iteration: 4000 / 8000 [ 50%]  (Warmup)
## Chain 3: Iteration: 4001 / 8000 [ 50%]  (Sampling)
## Chain 3: Iteration: 4800 / 8000 [ 60%]  (Sampling)
## Chain 3: Iteration: 5600 / 8000 [ 70%]  (Sampling)
## Chain 3: Iteration: 6400 / 8000 [ 80%]  (Sampling)
## Chain 3: Iteration: 7200 / 8000 [ 90%]  (Sampling)
## Chain 3: Iteration: 8000 / 8000 [100%]  (Sampling)
## Chain 3: 
## Chain 3:  Elapsed Time: 0.214671 seconds (Warm-up)
## Chain 3:                0.425084 seconds (Sampling)
## Chain 3:                0.639755 seconds (Total)
## Chain 3: 
## 
## SAMPLING FOR MODEL 'continuous' NOW (CHAIN 4).
## Chain 4: 
## Chain 4: Gradient evaluation took 2.1e-05 seconds
## Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.21 seconds.
## Chain 4: Adjust your expectations accordingly!
## Chain 4: 
## Chain 4: 
## Chain 4: Iteration:    1 / 8000 [  0%]  (Warmup)
## Chain 4: Iteration:  800 / 8000 [ 10%]  (Warmup)
## Chain 4: Iteration: 1600 / 8000 [ 20%]  (Warmup)
## Chain 4: Iteration: 2400 / 8000 [ 30%]  (Warmup)
## Chain 4: Iteration: 3200 / 8000 [ 40%]  (Warmup)
## Chain 4: Iteration: 4000 / 8000 [ 50%]  (Warmup)
## Chain 4: Iteration: 4001 / 8000 [ 50%]  (Sampling)
## Chain 4: Iteration: 4800 / 8000 [ 60%]  (Sampling)
## Chain 4: Iteration: 5600 / 8000 [ 70%]  (Sampling)
## Chain 4: Iteration: 6400 / 8000 [ 80%]  (Sampling)
## Chain 4: Iteration: 7200 / 8000 [ 90%]  (Sampling)
## Chain 4: Iteration: 8000 / 8000 [100%]  (Sampling)
## Chain 4: 
## Chain 4:  Elapsed Time: 0.156504 seconds (Warm-up)
## Chain 4:                0.395604 seconds (Sampling)
## Chain 4:                0.552108 seconds (Total)
## Chain 4:
  1. Visual and numerical summaries of B1:
# Trace plots of parallel chains
mcmc_trace(coffee_model, size = 0.1)

# Density plots of parallel chains
mcmc_dens_overlay(coffee_model)

Some summary statistics:

# Posterior summary statistics
tidy(coffee_model, effects = c("fixed", "aux"),
     conf.int = TRUE, conf.level = 0.80)
  1. The posterior median of B1 is around 8.7. This means that the median increase in total cup points for a one unit increase in aroma is 8.7.

  2. Yes. Since the 80% confidence interval for B1 is entirely positive, we can safely assume that there is a posiitve relationship between aroma and rating.

Exercise 10.15

  1. Select the first set from our simulated runs:
coffee_df <- as.data.frame(coffee_model)
first_set <- head(coffee_df, 1)
first_set

Using the first parameter set to simulate 572 new coffee ratings from the observed aroma grades

beta_0 <- first_set$`(Intercept)`
beta_1 <- first_set$aroma
sigma  <- first_set$sigma
set.seed(84735)
one_simulation <- new_coffee %>% 
  mutate(mu = beta_0 + beta_1 * aroma,
         simulated_ratings = rnorm(572, mean = mu, sd = sigma)) %>% 
  select(aroma, total_cup_points, simulated_ratings)

one_simulation
  1. Constructing density plot
ggplot(one_simulation, aes(x = simulated_ratings)) + 
  geom_density(color = "lightblue") + 
  geom_density(aes(x = total_cup_points), color = "darkblue")

Light blue represents one simulated posterior dataset; dark blue represents the actual observed total cup points. It appears that the simulation’s distribution has a higher degree of variance than the observed data. The actual data is much more concentrated, and has a slightly higher median.

  1. Using pp_check()
pp_check(coffee_model, nreps = 50) + 
  xlab("total_cup_points")

Again, the observed distribution has much lower variance than the simulated distributions. Also, the observed distribution is “smoothed” – does not adequately capture the bumps in density on the left side of the distribution.

  1. I think our choice to use a Normal model is reasonable. Despite the small “bumps” along the left side, the observed data is normally shaped–however, though the observed data follows a normal distribution, our predictions still don’t accurately reflect the observed data’s shape. I think the deviance between our predictions and the observed distribution could result from the weakly informative priors we used in our posterior model.

Exercise 10.16

  1. Simulating and plotting a posterior model for the 7.67 aroma rating
#we'll use the same first set coefficients
beta_0 <- first_set$`(Intercept)`
beta_1 <- first_set$aroma
sigma  <- first_set$sigma

set.seed(84735)
predict_767 <- coffee_df %>% 
  mutate(mu = `(Intercept)` + aroma*7.67,
         y_new = rnorm(16000, mean = mu, sd = sigma))

# Plot the posterior predictive model
ggplot(predict_767, aes(x = y_new)) + 
  geom_density()

  1. Calculating raw error:
predict_767 %>% 
  summarize(mean = mean(y_new), error = 84 - mean(y_new))

Calculating standardized error:

predict_767 %>% 
  summarize(sd = sd(y_new), error = 84 - mean(y_new),
            error_scaled = error / sd(y_new))
  1. Constructing a ppc_intervals() plot:
set.seed(84735)
predictions <- posterior_predict(coffee_model, newdata = new_coffee)
dim(predictions)
## [1] 16000   572
ppc_intervals(new_coffee$total_cup_points, yrep = predictions, x = new_coffee$aroma, 
              prob = 0.5, prob_outer = 0.95)

The dark blue datapoints represent the observed values. The light blue dots are the posterior predictive medians for each aroma (x). The 50% prediction intervals are the thicker blue bars; the 95% prediction intervals are the narrow blue bars. Since the observed dark blue dots overlap with the 95% confidence intervals for most aroma values, and with the 50% intervals for an even greater majority of aroma values, this plot tells us that the posterior model is producing pretty accurate predictions.

  1. How many batches have observed ratings that are within their 50% posterior prediction interval?
set.seed(84735)
prediction_summary(coffee_model, data = new_coffee)

64% of batches fall within the 50% posterior prediction interval.

Exercise 10.17

  1. Using prediction_summary_cv()
set.seed(84735)
cv_procedure <- prediction_summary_cv(
  model = coffee_model, data = new_coffee, k = 10)
cv_procedure$folds
cv_procedure$cv
  1. The MAE measures the typical difference between the observed and predicted Y values. The average MAE is 1.08. The scaled median absolute error measures the typical number of standard deviations that the observed Y values are from their predicted counterparts. The average MAE scaled is 0.51. The average within_50 percent across all 10 folds is 62%; this means that on average, 62% of batches fall within their 50% prediction interval. The average within_95 percent across all 10 folds is 96%; this means that on average, 96% of batches fall within their 95% prediction interval.

  2. We can compare these values from the 10 folds to those that arise when we run the summary on the entire dataset:

set.seed(84735)
prediction_summary(coffee_model, data = new_coffee)

These values are quite similar, which suggests that our 10-fold split worked as well as training the data on the entire dataset.

Exercise 10.18

We can answer this question by asking: 1) how was the data collected? (we don’t know this); 2) by whom and for what purpose was data collected? we don’t know who collected the data, but presumably it was for the purpose of evaluating the ratings of coffee beans based on which farm they come from, which might or might not have discriminatory implications (depending on the application of our model, if this algorithm is used to inform which farms are given a higher advertising budget, which farms are more highly exported, etc, then it could have discriminatory repurcussions); 3) what biases may be baked into the analysis? we need to know the sample population who produced the ratings. If the population that gave rise to the data is skewed, then the model will also not be representative of a broad set of demographics tastes.

Exercise 10.19

  1. Simulating normal regression model with aftertaste:
coffee_model_2 <- stan_glm(total_cup_points ~ aftertaste, data = new_coffee,
                       family = gaussian,
                       prior_intercept = normal(75, 10),
                       prior = normal(9, 0.1), 
                       prior_aux = exponential(0.001),
                       chains = 4, iter = 4000*2, seed = 84735)
## 
## SAMPLING FOR MODEL 'continuous' NOW (CHAIN 1).
## Chain 1: 
## Chain 1: Gradient evaluation took 2.6e-05 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.26 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1: 
## Chain 1: 
## Chain 1: Iteration:    1 / 8000 [  0%]  (Warmup)
## Chain 1: Iteration:  800 / 8000 [ 10%]  (Warmup)
## Chain 1: Iteration: 1600 / 8000 [ 20%]  (Warmup)
## Chain 1: Iteration: 2400 / 8000 [ 30%]  (Warmup)
## Chain 1: Iteration: 3200 / 8000 [ 40%]  (Warmup)
## Chain 1: Iteration: 4000 / 8000 [ 50%]  (Warmup)
## Chain 1: Iteration: 4001 / 8000 [ 50%]  (Sampling)
## Chain 1: Iteration: 4800 / 8000 [ 60%]  (Sampling)
## Chain 1: Iteration: 5600 / 8000 [ 70%]  (Sampling)
## Chain 1: Iteration: 6400 / 8000 [ 80%]  (Sampling)
## Chain 1: Iteration: 7200 / 8000 [ 90%]  (Sampling)
## Chain 1: Iteration: 8000 / 8000 [100%]  (Sampling)
## Chain 1: 
## Chain 1:  Elapsed Time: 0.181428 seconds (Warm-up)
## Chain 1:                0.364013 seconds (Sampling)
## Chain 1:                0.545441 seconds (Total)
## Chain 1: 
## 
## SAMPLING FOR MODEL 'continuous' NOW (CHAIN 2).
## Chain 2: 
## Chain 2: Gradient evaluation took 7.1e-05 seconds
## Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.71 seconds.
## Chain 2: Adjust your expectations accordingly!
## Chain 2: 
## Chain 2: 
## Chain 2: Iteration:    1 / 8000 [  0%]  (Warmup)
## Chain 2: Iteration:  800 / 8000 [ 10%]  (Warmup)
## Chain 2: Iteration: 1600 / 8000 [ 20%]  (Warmup)
## Chain 2: Iteration: 2400 / 8000 [ 30%]  (Warmup)
## Chain 2: Iteration: 3200 / 8000 [ 40%]  (Warmup)
## Chain 2: Iteration: 4000 / 8000 [ 50%]  (Warmup)
## Chain 2: Iteration: 4001 / 8000 [ 50%]  (Sampling)
## Chain 2: Iteration: 4800 / 8000 [ 60%]  (Sampling)
## Chain 2: Iteration: 5600 / 8000 [ 70%]  (Sampling)
## Chain 2: Iteration: 6400 / 8000 [ 80%]  (Sampling)
## Chain 2: Iteration: 7200 / 8000 [ 90%]  (Sampling)
## Chain 2: Iteration: 8000 / 8000 [100%]  (Sampling)
## Chain 2: 
## Chain 2:  Elapsed Time: 0.178169 seconds (Warm-up)
## Chain 2:                0.385859 seconds (Sampling)
## Chain 2:                0.564028 seconds (Total)
## Chain 2: 
## 
## SAMPLING FOR MODEL 'continuous' NOW (CHAIN 3).
## Chain 3: 
## Chain 3: Gradient evaluation took 2e-05 seconds
## Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.2 seconds.
## Chain 3: Adjust your expectations accordingly!
## Chain 3: 
## Chain 3: 
## Chain 3: Iteration:    1 / 8000 [  0%]  (Warmup)
## Chain 3: Iteration:  800 / 8000 [ 10%]  (Warmup)
## Chain 3: Iteration: 1600 / 8000 [ 20%]  (Warmup)
## Chain 3: Iteration: 2400 / 8000 [ 30%]  (Warmup)
## Chain 3: Iteration: 3200 / 8000 [ 40%]  (Warmup)
## Chain 3: Iteration: 4000 / 8000 [ 50%]  (Warmup)
## Chain 3: Iteration: 4001 / 8000 [ 50%]  (Sampling)
## Chain 3: Iteration: 4800 / 8000 [ 60%]  (Sampling)
## Chain 3: Iteration: 5600 / 8000 [ 70%]  (Sampling)
## Chain 3: Iteration: 6400 / 8000 [ 80%]  (Sampling)
## Chain 3: Iteration: 7200 / 8000 [ 90%]  (Sampling)
## Chain 3: Iteration: 8000 / 8000 [100%]  (Sampling)
## Chain 3: 
## Chain 3:  Elapsed Time: 0.184629 seconds (Warm-up)
## Chain 3:                0.387418 seconds (Sampling)
## Chain 3:                0.572047 seconds (Total)
## Chain 3: 
## 
## SAMPLING FOR MODEL 'continuous' NOW (CHAIN 4).
## Chain 4: 
## Chain 4: Gradient evaluation took 2.3e-05 seconds
## Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.23 seconds.
## Chain 4: Adjust your expectations accordingly!
## Chain 4: 
## Chain 4: 
## Chain 4: Iteration:    1 / 8000 [  0%]  (Warmup)
## Chain 4: Iteration:  800 / 8000 [ 10%]  (Warmup)
## Chain 4: Iteration: 1600 / 8000 [ 20%]  (Warmup)
## Chain 4: Iteration: 2400 / 8000 [ 30%]  (Warmup)
## Chain 4: Iteration: 3200 / 8000 [ 40%]  (Warmup)
## Chain 4: Iteration: 4000 / 8000 [ 50%]  (Warmup)
## Chain 4: Iteration: 4001 / 8000 [ 50%]  (Sampling)
## Chain 4: Iteration: 4800 / 8000 [ 60%]  (Sampling)
## Chain 4: Iteration: 5600 / 8000 [ 70%]  (Sampling)
## Chain 4: Iteration: 6400 / 8000 [ 80%]  (Sampling)
## Chain 4: Iteration: 7200 / 8000 [ 90%]  (Sampling)
## Chain 4: Iteration: 8000 / 8000 [100%]  (Sampling)
## Chain 4: 
## Chain 4:  Elapsed Time: 0.165953 seconds (Warm-up)
## Chain 4:                0.382858 seconds (Sampling)
## Chain 4:                0.548811 seconds (Total)
## Chain 4:
  1. Plotting to see how model compares to observations
pp_check(coffee_model_2, nreps = 50) + 
  xlab("total_cup_points")

The model looks like it generally follows the same median as the observed distribution. However, the observed distribution has a much lower level of variance than the predicted posterior model. Also, the predicted model again fails to capture the set of “bumps” on the left tail of the distribution. Therefore, I would say the normal model is not the best fit, or our priors could perhaps be improved.

  1. Obtaining 10 cross-fold validation metrics:
set.seed(84735)
cv_procedure <- prediction_summary_cv(
  model = coffee_model_2, data = new_coffee, k = 10)
cv_procedure$folds
cv_procedure$cv
  1. Which is the better predictor, aroma or aftertaste? It appears that aftertaste has a lower MAE and scaled MAE, and a comparable within_50/within_95 percentage. Therefore, I would say that aftertaste is the better predictor of the two.

Exercise 10.20

  1. Fitting the model:
data("weather_perth")
weather_data <- weather_perth %>% 
  select(maxtemp, mintemp)

head(weather_data)

Plot it quickly to get a sense of priors

ggplot(weather_data, aes(y = maxtemp, x = mintemp)) + 
  geom_point(size = 0.2) + 
  geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula 'y ~ x'

weather_model <- stan_glm(maxtemp ~ mintemp, data = weather_data,
                       family = gaussian,
                       prior_intercept = normal(10, 2),
                       prior = normal(2, 0.1), 
                       prior_aux = exponential(0.001),
                       chains = 4, iter = 4000*2, seed = 84735)
## 
## SAMPLING FOR MODEL 'continuous' NOW (CHAIN 1).
## Chain 1: 
## Chain 1: Gradient evaluation took 2.3e-05 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.23 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1: 
## Chain 1: 
## Chain 1: Iteration:    1 / 8000 [  0%]  (Warmup)
## Chain 1: Iteration:  800 / 8000 [ 10%]  (Warmup)
## Chain 1: Iteration: 1600 / 8000 [ 20%]  (Warmup)
## Chain 1: Iteration: 2400 / 8000 [ 30%]  (Warmup)
## Chain 1: Iteration: 3200 / 8000 [ 40%]  (Warmup)
## Chain 1: Iteration: 4000 / 8000 [ 50%]  (Warmup)
## Chain 1: Iteration: 4001 / 8000 [ 50%]  (Sampling)
## Chain 1: Iteration: 4800 / 8000 [ 60%]  (Sampling)
## Chain 1: Iteration: 5600 / 8000 [ 70%]  (Sampling)
## Chain 1: Iteration: 6400 / 8000 [ 80%]  (Sampling)
## Chain 1: Iteration: 7200 / 8000 [ 90%]  (Sampling)
## Chain 1: Iteration: 8000 / 8000 [100%]  (Sampling)
## Chain 1: 
## Chain 1:  Elapsed Time: 0.154996 seconds (Warm-up)
## Chain 1:                0.549567 seconds (Sampling)
## Chain 1:                0.704563 seconds (Total)
## Chain 1: 
## 
## SAMPLING FOR MODEL 'continuous' NOW (CHAIN 2).
## Chain 2: 
## Chain 2: Gradient evaluation took 2.6e-05 seconds
## Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.26 seconds.
## Chain 2: Adjust your expectations accordingly!
## Chain 2: 
## Chain 2: 
## Chain 2: Iteration:    1 / 8000 [  0%]  (Warmup)
## Chain 2: Iteration:  800 / 8000 [ 10%]  (Warmup)
## Chain 2: Iteration: 1600 / 8000 [ 20%]  (Warmup)
## Chain 2: Iteration: 2400 / 8000 [ 30%]  (Warmup)
## Chain 2: Iteration: 3200 / 8000 [ 40%]  (Warmup)
## Chain 2: Iteration: 4000 / 8000 [ 50%]  (Warmup)
## Chain 2: Iteration: 4001 / 8000 [ 50%]  (Sampling)
## Chain 2: Iteration: 4800 / 8000 [ 60%]  (Sampling)
## Chain 2: Iteration: 5600 / 8000 [ 70%]  (Sampling)
## Chain 2: Iteration: 6400 / 8000 [ 80%]  (Sampling)
## Chain 2: Iteration: 7200 / 8000 [ 90%]  (Sampling)
## Chain 2: Iteration: 8000 / 8000 [100%]  (Sampling)
## Chain 2: 
## Chain 2:  Elapsed Time: 0.156846 seconds (Warm-up)
## Chain 2:                0.556455 seconds (Sampling)
## Chain 2:                0.713301 seconds (Total)
## Chain 2: 
## 
## SAMPLING FOR MODEL 'continuous' NOW (CHAIN 3).
## Chain 3: 
## Chain 3: Gradient evaluation took 1.9e-05 seconds
## Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.19 seconds.
## Chain 3: Adjust your expectations accordingly!
## Chain 3: 
## Chain 3: 
## Chain 3: Iteration:    1 / 8000 [  0%]  (Warmup)
## Chain 3: Iteration:  800 / 8000 [ 10%]  (Warmup)
## Chain 3: Iteration: 1600 / 8000 [ 20%]  (Warmup)
## Chain 3: Iteration: 2400 / 8000 [ 30%]  (Warmup)
## Chain 3: Iteration: 3200 / 8000 [ 40%]  (Warmup)
## Chain 3: Iteration: 4000 / 8000 [ 50%]  (Warmup)
## Chain 3: Iteration: 4001 / 8000 [ 50%]  (Sampling)
## Chain 3: Iteration: 4800 / 8000 [ 60%]  (Sampling)
## Chain 3: Iteration: 5600 / 8000 [ 70%]  (Sampling)
## Chain 3: Iteration: 6400 / 8000 [ 80%]  (Sampling)
## Chain 3: Iteration: 7200 / 8000 [ 90%]  (Sampling)
## Chain 3: Iteration: 8000 / 8000 [100%]  (Sampling)
## Chain 3: 
## Chain 3:  Elapsed Time: 0.144368 seconds (Warm-up)
## Chain 3:                0.530713 seconds (Sampling)
## Chain 3:                0.675081 seconds (Total)
## Chain 3: 
## 
## SAMPLING FOR MODEL 'continuous' NOW (CHAIN 4).
## Chain 4: 
## Chain 4: Gradient evaluation took 1.9e-05 seconds
## Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.19 seconds.
## Chain 4: Adjust your expectations accordingly!
## Chain 4: 
## Chain 4: 
## Chain 4: Iteration:    1 / 8000 [  0%]  (Warmup)
## Chain 4: Iteration:  800 / 8000 [ 10%]  (Warmup)
## Chain 4: Iteration: 1600 / 8000 [ 20%]  (Warmup)
## Chain 4: Iteration: 2400 / 8000 [ 30%]  (Warmup)
## Chain 4: Iteration: 3200 / 8000 [ 40%]  (Warmup)
## Chain 4: Iteration: 4000 / 8000 [ 50%]  (Warmup)
## Chain 4: Iteration: 4001 / 8000 [ 50%]  (Sampling)
## Chain 4: Iteration: 4800 / 8000 [ 60%]  (Sampling)
## Chain 4: Iteration: 5600 / 8000 [ 70%]  (Sampling)
## Chain 4: Iteration: 6400 / 8000 [ 80%]  (Sampling)
## Chain 4: Iteration: 7200 / 8000 [ 90%]  (Sampling)
## Chain 4: Iteration: 8000 / 8000 [100%]  (Sampling)
## Chain 4: 
## Chain 4:  Elapsed Time: 0.148983 seconds (Warm-up)
## Chain 4:                0.520674 seconds (Sampling)
## Chain 4:                0.669657 seconds (Total)
## Chain 4:
  1. Checking out some summary stats:
# Posterior summary statistics
tidy(weather_model, effects = c("fixed", "aux"),
     conf.int = TRUE, conf.level = 0.80)

The for every 1 degree increase in mintemp, max temp goes about by 0.91 on average. At a min temp of 0, the max temp is about 13.4 on average. There is a high degree of variance in the distribution’s spread around the mean (average SD==4.31), meaning the predicted outcomes are fairly widely varied.

  1. First, we will produce some plots:
pp_check(weather_model, nreps = 50) + 
  xlab("maxtemp")

set.seed(84735)
predictions <- posterior_predict(weather_model, newdata = weather_data)
dim(predictions)
## [1] 16000  1000
ppc_intervals(weather_data$maxtemp, yrep = predictions, x = weather_data$mintemp, 
              prob = 0.5, prob_outer = 0.95)

Based on the first plot (ppcheck()) the posterior prediction seems to miss a key downward blip in the distribution, which almost makes me think that a normal model isn’t a good fit for this data. The ppc_intervals plot also looks very different from the other one we produced–while much of the data fall within the 95% interval, very little falls within the 50% interval.

set.seed(84735)
prediction_summary(weather_model, data = weather_data)

This is confirmed in our statistical check–the within_50 percentage is very low (45%). Based on the model’s deviance from the observed distribution, and our summary statistics, I would say that for this bimodal distribution, perhaps we have attempted to combine two normal distributions (each with its own mean); therefore, perhaps a single normal model isn’t the best fit for this dataset.

Exercise 10.21

  1. Tuning model for rides by humidity
data("bikes")
bikes <- bikes %>% 
  select(rides, humidity)

head(bikes)

Plot it quickly to get a sense of priors

ggplot(bikes, aes(y = rides, x = humidity)) + 
  geom_point(size = 0.2) + 
  geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula 'y ~ x'

bikes_model <- stan_glm(rides ~ humidity, data = bikes,
                       family = gaussian,
                       prior_intercept = normal(4000, 100),
                       prior = normal(10, 1), 
                       prior_aux = exponential(0.01),
                       chains = 4, iter = 4000*2, seed = 84735)
## 
## SAMPLING FOR MODEL 'continuous' NOW (CHAIN 1).
## Chain 1: 
## Chain 1: Gradient evaluation took 4.3e-05 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.43 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1: 
## Chain 1: 
## Chain 1: Iteration:    1 / 8000 [  0%]  (Warmup)
## Chain 1: Iteration:  800 / 8000 [ 10%]  (Warmup)
## Chain 1: Iteration: 1600 / 8000 [ 20%]  (Warmup)
## Chain 1: Iteration: 2400 / 8000 [ 30%]  (Warmup)
## Chain 1: Iteration: 3200 / 8000 [ 40%]  (Warmup)
## Chain 1: Iteration: 4000 / 8000 [ 50%]  (Warmup)
## Chain 1: Iteration: 4001 / 8000 [ 50%]  (Sampling)
## Chain 1: Iteration: 4800 / 8000 [ 60%]  (Sampling)
## Chain 1: Iteration: 5600 / 8000 [ 70%]  (Sampling)
## Chain 1: Iteration: 6400 / 8000 [ 80%]  (Sampling)
## Chain 1: Iteration: 7200 / 8000 [ 90%]  (Sampling)
## Chain 1: Iteration: 8000 / 8000 [100%]  (Sampling)
## Chain 1: 
## Chain 1:  Elapsed Time: 0.221353 seconds (Warm-up)
## Chain 1:                0.389746 seconds (Sampling)
## Chain 1:                0.611099 seconds (Total)
## Chain 1: 
## 
## SAMPLING FOR MODEL 'continuous' NOW (CHAIN 2).
## Chain 2: 
## Chain 2: Gradient evaluation took 2.6e-05 seconds
## Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.26 seconds.
## Chain 2: Adjust your expectations accordingly!
## Chain 2: 
## Chain 2: 
## Chain 2: Iteration:    1 / 8000 [  0%]  (Warmup)
## Chain 2: Iteration:  800 / 8000 [ 10%]  (Warmup)
## Chain 2: Iteration: 1600 / 8000 [ 20%]  (Warmup)
## Chain 2: Iteration: 2400 / 8000 [ 30%]  (Warmup)
## Chain 2: Iteration: 3200 / 8000 [ 40%]  (Warmup)
## Chain 2: Iteration: 4000 / 8000 [ 50%]  (Warmup)
## Chain 2: Iteration: 4001 / 8000 [ 50%]  (Sampling)
## Chain 2: Iteration: 4800 / 8000 [ 60%]  (Sampling)
## Chain 2: Iteration: 5600 / 8000 [ 70%]  (Sampling)
## Chain 2: Iteration: 6400 / 8000 [ 80%]  (Sampling)
## Chain 2: Iteration: 7200 / 8000 [ 90%]  (Sampling)
## Chain 2: Iteration: 8000 / 8000 [100%]  (Sampling)
## Chain 2: 
## Chain 2:  Elapsed Time: 0.411148 seconds (Warm-up)
## Chain 2:                0.368995 seconds (Sampling)
## Chain 2:                0.780143 seconds (Total)
## Chain 2: 
## 
## SAMPLING FOR MODEL 'continuous' NOW (CHAIN 3).
## Chain 3: 
## Chain 3: Gradient evaluation took 2.5e-05 seconds
## Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.25 seconds.
## Chain 3: Adjust your expectations accordingly!
## Chain 3: 
## Chain 3: 
## Chain 3: Iteration:    1 / 8000 [  0%]  (Warmup)
## Chain 3: Iteration:  800 / 8000 [ 10%]  (Warmup)
## Chain 3: Iteration: 1600 / 8000 [ 20%]  (Warmup)
## Chain 3: Iteration: 2400 / 8000 [ 30%]  (Warmup)
## Chain 3: Iteration: 3200 / 8000 [ 40%]  (Warmup)
## Chain 3: Iteration: 4000 / 8000 [ 50%]  (Warmup)
## Chain 3: Iteration: 4001 / 8000 [ 50%]  (Sampling)
## Chain 3: Iteration: 4800 / 8000 [ 60%]  (Sampling)
## Chain 3: Iteration: 5600 / 8000 [ 70%]  (Sampling)
## Chain 3: Iteration: 6400 / 8000 [ 80%]  (Sampling)
## Chain 3: Iteration: 7200 / 8000 [ 90%]  (Sampling)
## Chain 3: Iteration: 8000 / 8000 [100%]  (Sampling)
## Chain 3: 
## Chain 3:  Elapsed Time: 0.23504 seconds (Warm-up)
## Chain 3:                0.352951 seconds (Sampling)
## Chain 3:                0.587991 seconds (Total)
## Chain 3: 
## 
## SAMPLING FOR MODEL 'continuous' NOW (CHAIN 4).
## Chain 4: 
## Chain 4: Gradient evaluation took 4.2e-05 seconds
## Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.42 seconds.
## Chain 4: Adjust your expectations accordingly!
## Chain 4: 
## Chain 4: 
## Chain 4: Iteration:    1 / 8000 [  0%]  (Warmup)
## Chain 4: Iteration:  800 / 8000 [ 10%]  (Warmup)
## Chain 4: Iteration: 1600 / 8000 [ 20%]  (Warmup)
## Chain 4: Iteration: 2400 / 8000 [ 30%]  (Warmup)
## Chain 4: Iteration: 3200 / 8000 [ 40%]  (Warmup)
## Chain 4: Iteration: 4000 / 8000 [ 50%]  (Warmup)
## Chain 4: Iteration: 4001 / 8000 [ 50%]  (Sampling)
## Chain 4: Iteration: 4800 / 8000 [ 60%]  (Sampling)
## Chain 4: Iteration: 5600 / 8000 [ 70%]  (Sampling)
## Chain 4: Iteration: 6400 / 8000 [ 80%]  (Sampling)
## Chain 4: Iteration: 7200 / 8000 [ 90%]  (Sampling)
## Chain 4: Iteration: 8000 / 8000 [100%]  (Sampling)
## Chain 4: 
## Chain 4:  Elapsed Time: 0.662984 seconds (Warm-up)
## Chain 4:                0.357115 seconds (Sampling)
## Chain 4:                1.0201 seconds (Total)
## Chain 4:
  1. Checking out some summary stats:
# Posterior summary statistics
tidy(bikes_model, effects = c("fixed", "aux"),
     conf.int = TRUE, conf.level = 0.80)

The for every 1 unit increase in humidity, number of rides goes about by 9.23 on average. At a humidity level of 0, the number of rides is 3068 on average. There is a high degree of variance in the distribution’s spread around the mean (average SD==1578), meaning the predicted outcomes are fairly widely varied.

  1. First, we will produce some plots:
pp_check(bikes_model, nreps = 50) + 
  xlab("humidity")

Again, the predicted posterior neglects to account for some bimodal-ness in the distribution. However, this time, the variance estimated is closer to the observed level, and the predicted median is more closely aligned with the observed median.

set.seed(84735)
predictions <- posterior_predict(bikes_model, newdata = bikes)
dim(predictions)
## [1] 16000   500
ppc_intervals(bikes$rides, yrep = predictions, x = bikes$humidity, 
              prob = 0.5, prob_outer = 0.95)

These 50% and 95% posterior predictive intervals are very large because our estimated sigma is so large. Most of our observed values fall within the 95% predictive intervals, but not within the 50% intervals (as shown more clearly in the statistics below). It is not problematic that the intervals are so long because this captures the true spread of the data–our model is correct to predict such a high level of variance, because humidity and ridership are not that strongly correlated.

set.seed(84735)
prediction_summary(bikes_model, data = bikes)

This is an example in which the MAE and MAE scaled vary greatly, because the sigma is so large. Here, it’s more useful to look at MAE scaled to get a sense of how much error we’re dealing with. I’m honestly not sure if I should say that the normal model is a good fit here because the distribution has bimodal tendencies. I don’t know of a model that best approximates such a distribution…so perhaps normal is not the best but it is ok or the best we can manage.